Extending k - means with the description comes first approach

نویسندگان

  • Jerzy Stefanowski
  • Dawid Weiss
چکیده

This paper describes a technique for clustering large collections of short and medium length text documents such as press articles, news stories and the like. The technique called description comes first (DCF) consists of identification of related document clusters, selection of salient phrases relevant to these clusters and reallocation of documents matching the selected phrases to form final document groups. The advantages of this technique include more comprehensive cluster labels and clearer (more transparent) relationship between cluster labels and their content. We demonstrate the DCF by taking a standard k-means algorithm as a baseline and weaving DCF elements into it; the outcome is the descriptive kmeans (DKM) algorithm. The paper goes through technical background explaining how to implement DKM efficiently and ends with the description of an experiment measuring clustering quality on a benchmark document collection 20-newsgroups. Short fragments of this paper appeared at the poster session of the RIAO 2007 conference, Pittsburgh, PA, USA (electronic proceedings only).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A hybrid DEA-based K-means and invasive weed optimization for facility location problem

In this paper, instead of the classical approach to the multi-criteria location selection problem, a new approach was presented based on selecting a portfolio of locations. First, the indices affecting the selection of maintenance stations were collected. The K-means model was used for clustering the maintenance stations. The optimal number of clusters was calculated through the Silhou...

متن کامل

Organizational Learning and Knowledge Spillover in Innovation Networks: Agent-Based Approach (Extending SKIN Framework)

In knowledge-based economy, knowledge has a public good and non-rivalry nature. Firms build their own knowledge stock not only by means of internal R&D and collaboration with partners, but also by means of previously spilled over knowledge of other firms and public research laboratories (such as universities). Firms based on their absorptive capacity, and level of intra-industry and extra-indus...

متن کامل

تأمین و گسترش برابری فرصتها و عدالت آموزشی در آموزش و پرورش استان اصفهان

Plan: The present research has studied the present and desired strategies for facing the challenge of providing and extending equal opportunity and educational justice in education in Isfahan. Method: This research has studied the present and desired strategies for providing and extending equal opportunity and educational justice in education in Isfahan. 126 person has selected as statisti...

متن کامل

Proposing an approach to calculate headway intervals to improve bus fleet scheduling using a data mining algorithm

The growth of AVL (Automatic Vehicle Location) systems leads to huge amount of data about different parts of bus fleet (buses, stations, passenger, etc.) which is very useful to improve bus fleet efficiency. In addition, by processing fleet and passengers’ historical data it is possible to detect passenger’s behavioral patterns in different parts of the day and to use it in order to improve fle...

متن کامل

تأمین و گسترش برابری فرصتها و عدالت آموزشی در آموزش و پرورش استان اصفهان

Plan: The present research has studied the present and desired strategies for facing the challenge of providing and extending equal opportunity and educational justice in education in Isfahan. Method: This research has studied the present and desired strategies for providing and extending equal opportunity and educational justice in education in Isfahan. 126 person has selected as statisti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008